--- Credit Score in Data Analysis Course ---¶
I. Data Description¶
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
About Dataset¶
The dataset comprises information on 1000 customers, with 84 features derived from their financial transactions and current financial standing. The primary objective is to leverage this dataset for credit risk estimation and predicting potential defaults.
- CUST_ID: Unique customer identifier
Key Target Variables:
- CREDIT_SCORE: Numerical target variable representing the customer's credit score (integer)
- DEFAULT: Binary target variable indicating if the customer has defaulted (1) or not (0)
Description of Features:
INCOME: Total income in the last 12 months
SAVINGS: Total savings in the last 12 months
DEBT: Total existing debt
R_SAVINGS_INCOME: Ratio of savings to income
R_DEBT_INCOME: Ratio of debt to income
R_DEBT_SAVINGS: Ratio of debt to savings
Transaction groups (GROCERIES, CLOTHING, HOUSING, EDUCATION, HEALTH, TRAVEL, ENTERTAINMENT,GAMBLING, UTILITIES, TAX, FINES) are categorized.
T_{GROUP}_6: Total expenditure in that group in the last 6 months
T_GROUP_12: Total expenditure in that group in the last 12 months
R_[GROUP]: Ratio of T_[GROUP]6 to T[GROUP]_12
R_[GROUP]INCOME: Ratio of T[GROUP]_12 to INCOME
R_[GROUP]SAVINGS: Ratio of T[GROUP]_12 to SAVINGS
R_[GROUP]DEBT: Ratio of T[GROUP]_12 to DEBT
Categorical Features:
- CAT_GAMBLING: Gambling category (none, low, high)
- CAT_DEBT: 1 if the customer has debt; 0 otherwise
- CAT_CREDIT_CARD: 1 if the customer has a credit card; 0 otherwise
- CAT_MORTGAGE: 1 if the customer has a mortgage; 0 otherwise
- CAT_SAVINGS_ACCOUNT: 1 if the customer has a savings account; 0 otherwise
- CAT_DEPENDENTS: 1 if the customer has any dependents; 0 otherwise
Summary :
- Customer ID (string)
- Financial numeric features (Income, Savings, Debt,…)
- Ratios
- Transaction amounts over 6 and 12 months
- Binary categorical indicators (one-hot encoded) Two target variables:
- CREDIT_SCORE (regression)
- DEFAULT (classification)
1. Loading Dataset¶
df = pd.read_csv(r"./credit_score.csv")
df
| CUST_ID | INCOME | SAVINGS | DEBT | R_SAVINGS_INCOME | R_DEBT_INCOME | R_DEBT_SAVINGS | T_CLOTHING_12 | T_CLOTHING_6 | R_CLOTHING | ... | R_EXPENDITURE_SAVINGS | R_EXPENDITURE_DEBT | CAT_GAMBLING | CAT_DEBT | CAT_CREDIT_CARD | CAT_MORTGAGE | CAT_SAVINGS_ACCOUNT | CAT_DEPENDENTS | CREDIT_SCORE | DEFAULT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | C02COQEVYU | 33269 | 0 | 532304 | 0.0000 | 16.0000 | 1.2000 | 1889 | 945 | 0.5003 | ... | 0.0000 | 0.0625 | High | 1 | 0 | 0 | 0 | 0 | 444 | 1 |
| 1 | C02OZKC0ZF | 77158 | 91187 | 315648 | 1.1818 | 4.0909 | 3.4615 | 5818 | 111 | 0.0191 | ... | 0.7692 | 0.2222 | No | 1 | 0 | 0 | 1 | 0 | 625 | 0 |
| 2 | C03FHP2D0A | 30917 | 21642 | 534864 | 0.7000 | 17.3000 | 24.7142 | 1157 | 860 | 0.7433 | ... | 1.4286 | 0.0578 | High | 1 | 0 | 0 | 1 | 0 | 469 | 1 |
| 3 | C03PVPPHOY | 80657 | 64526 | 629125 | 0.8000 | 7.8000 | 9.7499 | 6857 | 3686 | 0.5376 | ... | 1.2500 | 0.1282 | High | 1 | 0 | 0 | 1 | 0 | 559 | 0 |
| 4 | C04J69MUX0 | 149971 | 1172498 | 2399531 | 7.8182 | 16.0000 | 2.0465 | 1978 | 322 | 0.1628 | ... | 0.1163 | 0.0568 | High | 1 | 1 | 1 | 1 | 1 | 473 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 995 | CZQHJC9HDH | 328892 | 1465066 | 5501471 | 4.4546 | 16.7273 | 3.7551 | 16701 | 10132 | 0.6067 | ... | 0.2041 | 0.0543 | High | 1 | 1 | 1 | 1 | 1 | 418 | 0 |
| 996 | CZRA4MLB0P | 81404 | 88805 | 680837 | 1.0909 | 8.3637 | 7.6667 | 5400 | 1936 | 0.3585 | ... | 0.8333 | 0.1087 | No | 1 | 0 | 0 | 1 | 0 | 589 | 1 |
| 997 | CZSOD1KVFX | 0 | 42428 | 30760 | 3.2379 | 8.1889 | 0.7250 | 0 | 0 | 0.8779 | ... | 0.2500 | 0.3448 | No | 1 | 0 | 0 | 1 | 0 | 499 | 0 |
| 998 | CZWC76UAUT | 36011 | 8002 | 604181 | 0.2222 | 16.7777 | 75.5037 | 1993 | 1271 | 0.6377 | ... | 5.0002 | 0.0662 | No | 1 | 1 | 0 | 1 | 0 | 507 | 0 |
| 999 | CZZV5B3SAL | 44266 | 309859 | 44266 | 6.9999 | 1.0000 | 0.1429 | 1574 | 1264 | 0.8030 | ... | 0.1587 | 1.1111 | No | 1 | 0 | 0 | 1 | 0 | 657 | 0 |
1000 rows × 87 columns
df.columns[1:5]
Index(['INCOME', 'SAVINGS', 'DEBT', 'R_SAVINGS_INCOME'], dtype='object')
2. Traget Variable¶
df[['DEFAULT', 'CREDIT_SCORE']]
| DEFAULT | CREDIT_SCORE | |
|---|---|---|
| 0 | 1 | 444 |
| 1 | 0 | 625 |
| 2 | 1 | 469 |
| 3 | 0 | 559 |
| 4 | 0 | 473 |
| ... | ... | ... |
| 995 | 0 | 418 |
| 996 | 1 | 589 |
| 997 | 0 | 499 |
| 998 | 0 | 507 |
| 999 | 0 | 657 |
1000 rows × 2 columns
print("Dataset infomation : ")
df.info()
Dataset infomation : <class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 87 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CUST_ID 1000 non-null object 1 INCOME 1000 non-null int64 2 SAVINGS 1000 non-null int64 3 DEBT 1000 non-null int64 4 R_SAVINGS_INCOME 1000 non-null float64 5 R_DEBT_INCOME 1000 non-null float64 6 R_DEBT_SAVINGS 1000 non-null float64 7 T_CLOTHING_12 1000 non-null int64 8 T_CLOTHING_6 1000 non-null int64 9 R_CLOTHING 1000 non-null float64 10 R_CLOTHING_INCOME 1000 non-null float64 11 R_CLOTHING_SAVINGS 1000 non-null float64 12 R_CLOTHING_DEBT 1000 non-null float64 13 T_EDUCATION_12 1000 non-null int64 14 T_EDUCATION_6 1000 non-null int64 15 R_EDUCATION 1000 non-null float64 16 R_EDUCATION_INCOME 1000 non-null float64 17 R_EDUCATION_SAVINGS 1000 non-null float64 18 R_EDUCATION_DEBT 1000 non-null float64 19 T_ENTERTAINMENT_12 1000 non-null int64 20 T_ENTERTAINMENT_6 1000 non-null int64 21 R_ENTERTAINMENT 1000 non-null float64 22 R_ENTERTAINMENT_INCOME 1000 non-null float64 23 R_ENTERTAINMENT_SAVINGS 1000 non-null float64 24 R_ENTERTAINMENT_DEBT 1000 non-null float64 25 T_FINES_12 1000 non-null int64 26 T_FINES_6 1000 non-null int64 27 R_FINES 1000 non-null float64 28 R_FINES_INCOME 1000 non-null float64 29 R_FINES_SAVINGS 1000 non-null float64 30 R_FINES_DEBT 1000 non-null float64 31 T_GAMBLING_12 1000 non-null int64 32 T_GAMBLING_6 1000 non-null int64 33 R_GAMBLING 1000 non-null float64 34 R_GAMBLING_INCOME 1000 non-null float64 35 R_GAMBLING_SAVINGS 1000 non-null float64 36 R_GAMBLING_DEBT 1000 non-null float64 37 T_GROCERIES_12 1000 non-null int64 38 T_GROCERIES_6 1000 non-null int64 39 R_GROCERIES 1000 non-null float64 40 R_GROCERIES_INCOME 1000 non-null float64 41 R_GROCERIES_SAVINGS 1000 non-null float64 42 R_GROCERIES_DEBT 1000 non-null float64 43 T_HEALTH_12 1000 non-null int64 44 T_HEALTH_6 1000 non-null int64 45 R_HEALTH 1000 non-null float64 46 R_HEALTH_INCOME 1000 non-null float64 47 R_HEALTH_SAVINGS 1000 non-null float64 48 R_HEALTH_DEBT 1000 non-null float64 49 T_HOUSING_12 1000 non-null int64 50 T_HOUSING_6 1000 non-null int64 51 R_HOUSING 1000 non-null float64 52 R_HOUSING_INCOME 1000 non-null float64 53 R_HOUSING_SAVINGS 1000 non-null float64 54 R_HOUSING_DEBT 1000 non-null float64 55 T_TAX_12 1000 non-null int64 56 T_TAX_6 1000 non-null int64 57 R_TAX 1000 non-null float64 58 R_TAX_INCOME 1000 non-null float64 59 R_TAX_SAVINGS 1000 non-null float64 60 R_TAX_DEBT 1000 non-null float64 61 T_TRAVEL_12 1000 non-null int64 62 T_TRAVEL_6 1000 non-null int64 63 R_TRAVEL 1000 non-null float64 64 R_TRAVEL_INCOME 1000 non-null float64 65 R_TRAVEL_SAVINGS 1000 non-null float64 66 R_TRAVEL_DEBT 1000 non-null float64 67 T_UTILITIES_12 1000 non-null int64 68 T_UTILITIES_6 1000 non-null int64 69 R_UTILITIES 1000 non-null float64 70 R_UTILITIES_INCOME 1000 non-null float64 71 R_UTILITIES_SAVINGS 1000 non-null float64 72 R_UTILITIES_DEBT 1000 non-null float64 73 T_EXPENDITURE_12 1000 non-null int64 74 T_EXPENDITURE_6 1000 non-null int64 75 R_EXPENDITURE 1000 non-null float64 76 R_EXPENDITURE_INCOME 1000 non-null float64 77 R_EXPENDITURE_SAVINGS 1000 non-null float64 78 R_EXPENDITURE_DEBT 1000 non-null float64 79 CAT_GAMBLING 1000 non-null object 80 CAT_DEBT 1000 non-null int64 81 CAT_CREDIT_CARD 1000 non-null int64 82 CAT_MORTGAGE 1000 non-null int64 83 CAT_SAVINGS_ACCOUNT 1000 non-null int64 84 CAT_DEPENDENTS 1000 non-null int64 85 CREDIT_SCORE 1000 non-null int64 86 DEFAULT 1000 non-null int64 dtypes: float64(51), int64(34), object(2) memory usage: 679.8+ KB
# Checking data types of each columns
print(df.dtypes)
CUST_ID object
INCOME int64
SAVINGS int64
DEBT int64
R_SAVINGS_INCOME float64
...
CAT_MORTGAGE int64
CAT_SAVINGS_ACCOUNT int64
CAT_DEPENDENTS int64
CREDIT_SCORE int64
DEFAULT int64
Length: 87, dtype: object
# Count total number of each data types
print(df.dtypes.value_counts())
float64 51 int64 34 object 2 Name: count, dtype: int64
# Summary statistics
print("Summary Statistics:")
df.describe()
Summary Statistics:
| INCOME | SAVINGS | DEBT | R_SAVINGS_INCOME | R_DEBT_INCOME | R_DEBT_SAVINGS | T_CLOTHING_12 | T_CLOTHING_6 | R_CLOTHING | R_CLOTHING_INCOME | ... | R_EXPENDITURE_INCOME | R_EXPENDITURE_SAVINGS | R_EXPENDITURE_DEBT | CAT_DEBT | CAT_CREDIT_CARD | CAT_MORTGAGE | CAT_SAVINGS_ACCOUNT | CAT_DEPENDENTS | CREDIT_SCORE | DEFAULT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1000.000000 | 1.000000e+03 | 1.000000e+03 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | ... | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.00000 | 1000.000000 | 1000.000000 |
| mean | 121610.019000 | 4.131896e+05 | 7.907180e+05 | 4.063477 | 6.068449 | 5.867252 | 6822.401000 | 3466.320000 | 0.454848 | 0.055557 | ... | 0.943607 | 0.913340 | 0.605276 | 0.944000 | 0.236000 | 0.173000 | 0.993000 | 0.15000 | 586.712000 | 0.284000 |
| std | 113716.699591 | 4.429160e+05 | 9.817904e+05 | 3.968097 | 5.847878 | 16.788356 | 7486.225932 | 5118.942977 | 0.236036 | 0.037568 | ... | 0.168989 | 1.625278 | 1.299382 | 0.230037 | 0.424835 | 0.378437 | 0.083414 | 0.35725 | 63.413882 | 0.451162 |
| min | 0.000000 | 0.000000e+00 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.003400 | ... | 0.666700 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 300.000000 | 0.000000 |
| 25% | 30450.250000 | 5.971975e+04 | 5.396675e+04 | 1.000000 | 1.454500 | 0.206200 | 1084.500000 | 319.500000 | 0.263950 | 0.029700 | ... | 0.833300 | 0.158700 | 0.100000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 0.00000 | 554.750000 | 0.000000 |
| 50% | 85090.000000 | 2.738505e+05 | 3.950955e+05 | 2.545450 | 4.911550 | 2.000000 | 4494.000000 | 1304.000000 | 0.468850 | 0.046800 | ... | 0.909100 | 0.327950 | 0.178600 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 0.00000 | 596.000000 | 0.000000 |
| 75% | 181217.500000 | 6.222600e+05 | 1.193230e+06 | 6.307100 | 8.587475 | 4.509600 | 10148.500000 | 4555.500000 | 0.626300 | 0.069400 | ... | 1.000000 | 0.833300 | 0.588200 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 0.00000 | 630.000000 | 1.000000 |
| max | 662094.000000 | 2.911863e+06 | 5.968620e+06 | 16.111200 | 37.000600 | 292.842100 | 43255.000000 | 39918.000000 | 1.058300 | 0.251700 | ... | 2.000200 | 10.009900 | 10.005300 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.00000 | 800.000000 | 1.000000 |
8 rows × 85 columns
II. Data Preprocessing¶
df = df.drop(columns=["CUST_ID"])
drop ID out since it is just an identifier.
1. Detecting Missing values and Duplicates¶
N_A = df.isna().sum().sort_values(ascending=False)
N_A, N_A.sum()
(INCOME 0
SAVINGS 0
DEBT 0
R_SAVINGS_INCOME 0
R_DEBT_INCOME 0
..
CAT_MORTGAGE 0
CAT_SAVINGS_ACCOUNT 0
CAT_DEPENDENTS 0
CREDIT_SCORE 0
DEFAULT 0
Length: 86, dtype: int64,
np.int64(0))
# duplicate
duplicate_rows = df[df.duplicated(keep=False)]
duplicate_row = df.duplicated().sum()
print("Sum total duplicates = ", duplicate_row)
Sum total duplicates = 0
2. Numerical conversion¶
cat_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
print("Categorical columns:", cat_cols)
Categorical columns: ['CAT_GAMBLING']
df["CAT_GAMBLING"].value_counts()
CAT_GAMBLING No 620 High 264 Low 116 Name: count, dtype: int64
gambling = {'High':2,'Low':1,'No':0}
df["CAT_GAMBLING"] = df["CAT_GAMBLING"].map(gambling)
df.head()
| INCOME | SAVINGS | DEBT | R_SAVINGS_INCOME | R_DEBT_INCOME | R_DEBT_SAVINGS | T_CLOTHING_12 | T_CLOTHING_6 | R_CLOTHING | R_CLOTHING_INCOME | ... | R_EXPENDITURE_SAVINGS | R_EXPENDITURE_DEBT | CAT_GAMBLING | CAT_DEBT | CAT_CREDIT_CARD | CAT_MORTGAGE | CAT_SAVINGS_ACCOUNT | CAT_DEPENDENTS | CREDIT_SCORE | DEFAULT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 33269 | 0 | 532304 | 0.0000 | 16.0000 | 1.2000 | 1889 | 945 | 0.5003 | 0.0568 | ... | 0.0000 | 0.0625 | 2 | 1 | 0 | 0 | 0 | 0 | 444 | 1 |
| 1 | 77158 | 91187 | 315648 | 1.1818 | 4.0909 | 3.4615 | 5818 | 111 | 0.0191 | 0.0754 | ... | 0.7692 | 0.2222 | 0 | 1 | 0 | 0 | 1 | 0 | 625 | 0 |
| 2 | 30917 | 21642 | 534864 | 0.7000 | 17.3000 | 24.7142 | 1157 | 860 | 0.7433 | 0.0374 | ... | 1.4286 | 0.0578 | 2 | 1 | 0 | 0 | 1 | 0 | 469 | 1 |
| 3 | 80657 | 64526 | 629125 | 0.8000 | 7.8000 | 9.7499 | 6857 | 3686 | 0.5376 | 0.0850 | ... | 1.2500 | 0.1282 | 2 | 1 | 0 | 0 | 1 | 0 | 559 | 0 |
| 4 | 149971 | 1172498 | 2399531 | 7.8182 | 16.0000 | 2.0465 | 1978 | 322 | 0.1628 | 0.0132 | ... | 0.1163 | 0.0568 | 2 | 1 | 1 | 1 | 1 | 1 | 473 | 0 |
5 rows × 86 columns
3. Variance Thresholding¶
variances = df.var().sort_values()
import matplotlib.pyplot as plt
plt.figure(figsize=(10,4))
plt.plot(variances.values)
plt.title("Variance of Features")
plt.ylabel("Variance")
plt.show()
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.00001)
original_columns = df.columns
selector.fit(df)
remaining_columns = df.columns[selector.get_support(indices=True)]
removed_columns = set(original_columns) - set(remaining_columns)
print("Removed columns:", removed_columns)
print("Remaining:", len(remaining_columns))
df.head()
Removed columns: {'R_HOUSING', 'R_UTILITIES', 'R_EDUCATION', 'R_FINES_INCOME', 'R_FINES_DEBT'}
Remaining: 81
| INCOME | SAVINGS | DEBT | R_SAVINGS_INCOME | R_DEBT_INCOME | R_DEBT_SAVINGS | T_CLOTHING_12 | T_CLOTHING_6 | R_CLOTHING | R_CLOTHING_INCOME | ... | R_EXPENDITURE_SAVINGS | R_EXPENDITURE_DEBT | CAT_GAMBLING | CAT_DEBT | CAT_CREDIT_CARD | CAT_MORTGAGE | CAT_SAVINGS_ACCOUNT | CAT_DEPENDENTS | CREDIT_SCORE | DEFAULT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 33269 | 0 | 532304 | 0.0000 | 16.0000 | 1.2000 | 1889 | 945 | 0.5003 | 0.0568 | ... | 0.0000 | 0.0625 | 2 | 1 | 0 | 0 | 0 | 0 | 444 | 1 |
| 1 | 77158 | 91187 | 315648 | 1.1818 | 4.0909 | 3.4615 | 5818 | 111 | 0.0191 | 0.0754 | ... | 0.7692 | 0.2222 | 0 | 1 | 0 | 0 | 1 | 0 | 625 | 0 |
| 2 | 30917 | 21642 | 534864 | 0.7000 | 17.3000 | 24.7142 | 1157 | 860 | 0.7433 | 0.0374 | ... | 1.4286 | 0.0578 | 2 | 1 | 0 | 0 | 1 | 0 | 469 | 1 |
| 3 | 80657 | 64526 | 629125 | 0.8000 | 7.8000 | 9.7499 | 6857 | 3686 | 0.5376 | 0.0850 | ... | 1.2500 | 0.1282 | 2 | 1 | 0 | 0 | 1 | 0 | 559 | 0 |
| 4 | 149971 | 1172498 | 2399531 | 7.8182 | 16.0000 | 2.0465 | 1978 | 322 | 0.1628 | 0.0132 | ... | 0.1163 | 0.0568 | 2 | 1 | 1 | 1 | 1 | 1 | 473 | 0 |
5 rows × 86 columns
4. Outlier detection¶
num_cols = df.select_dtypes(include=['int64', 'float64']).columns
cols = num_cols[:8]
fig, axes = plt.subplots(2, 4, figsize=(20, 10))
for idx, col in enumerate(cols):
row = idx // 4
col_pos = idx % 4
axes[row, col_pos].boxplot(df[col], vert=True, patch_artist=True,
boxprops=dict(facecolor='orange'))
axes[row, col_pos].set_title(f"Box Plot for {col}")
axes[row, col_pos].set_ylabel(col)
plt.tight_layout()
plt.show()
In this situation, We decide not to remove the outliers because they are the customers that we want to detect.
For example, people with extremely high debt, people with zero saving. These outliers are high-risk cases.
Histogram distribution¶
plots_per_page = 12
cols_per_row = 4
rows_per_page = 3
for i in range(0, len(num_cols), plots_per_page):
subset = num_cols[i:i+plots_per_page]
fig, axes = plt.subplots(rows_per_page, cols_per_row, figsize=(20, 12))
axes = axes.flatten()
for j, col in enumerate(subset):
axes[j].hist(df[col], bins=30, color='skyblue', edgecolor='black')
axes[j].set_title(col)
# Hide unused axes
for k in range(len(subset), plots_per_page):
axes[k].set_visible(False)
plt.tight_layout()
plt.show()
Feature Scaling¶
Group to do log-transformation
# Group 1 — Heavy-tailed monetary variables
money_cols = ["INCOME", "SAVINGS", "DEBT"]
# Group 2 — All T_* expenditure features
t_cols = [col for col in df.columns if col.startswith("T_")]
# Group 3 — Extreme ratio variables (_SAVINGS, _DEBT)
extreme_ratio_cols = [col for col in df.columns
if col.endswith("_SAVINGS") or col.endswith("_DEBT")]
# Group 4 — Moderately skewed (_INCOME ratios)
income_ratio_cols = [col for col in df.columns if col.endswith("_INCOME")]
# Group 5 — Balanced ratio variables (no transform)
balanced_ratio_cols = [
col for col in df.columns
if col.startswith("R_")
and col not in extreme_ratio_cols
and col not in income_ratio_cols
]
# Group 6 — Binary categorical variables
binary_cols = [col for col in df.columns if col.startswith("CAT_")]
# Group 7 — Targets
target_cols = ["CREDIT_SCORE", "DEFAULT"]
def plot_group(columns, title, df, cols_per_row=4):
if len(columns) == 0:
return
rows = (len(columns) + cols_per_row - 1) // cols_per_row
fig, axes = plt.subplots(rows, cols_per_row, figsize=(20, rows * 4))
axes = axes.flatten()
for i, col in enumerate(columns):
axes[i].hist(df[col], bins=30, color='skyblue', edgecolor='black')
axes[i].set_title(col)
# Hide empty axes
for j in range(len(columns), len(axes)):
axes[j].set_visible(False)
fig.suptitle(title, fontsize=20)
plt.tight_layout()
plt.show()
plot_group(money_cols, "Group 1 — Heavy-Tailed Monetary Variables\n", df)
plot_group(t_cols, "Group 2 — T_* Expenditure Variables\n", df)
plot_group(extreme_ratio_cols, "Group 3 — Extreme Ratio Variables (_SAVINGS & _DEBT)\n", df)
plot_group(income_ratio_cols, "Group 4 — Moderately Skewed Income Ratios\n", df)
plot_group(balanced_ratio_cols, "Group 5 — Balanced Ratio Variables\n", df)
plot_group(binary_cols, "Group 6 — Binary Variables\n", df)
plot_group(target_cols, "Group 7 — Target Variables\n", df)
# 1. Heavy monetary variables
must_transform = set(money_cols)
# 2. All T_* expenditure variables
must_transform.update(t_cols)
# 3. Extreme *_SAVINGS and *_DEBT ratios (exclude binary CAT_DEBT)
must_transform.update([c for c in extreme_ratio_cols if c != "CAT_DEBT"])
# 4. Strongly skewed income ratios (subset of income_ratio_cols)
strong_income_transform = [
"R_SAVINGS_INCOME", "R_DEBT_INCOME", "R_EDUCATION_INCOME",
"R_ENTERTAINMENT_INCOME", "R_GAMBLING_INCOME",
"R_HEALTH_INCOME", "R_TAX_INCOME", "R_TRAVEL_INCOME"
]
must_transform.update(strong_income_transform)
Does transform really help in PCA? Yes, because in PCA, it already assumes that the features are roughly normal.
5. Log transformation¶
- Log-transform fixes skew
- Standardization fixes scale
# Apply log1p transform
df_log = df.copy()
df_log[list(must_transform)] = np.log1p(df[list(must_transform)])
def compare_hist(col):
fig, axes = plt.subplots(1, 2, figsize=(12,4))
axes[0].hist(df[col], bins=40, color='skyblue', edgecolor='black')
axes[0].set_title(f"{col} — BEFORE")
axes[1].hist(df_log[col], bins=40, color='salmon', edgecolor='black')
axes[1].set_title(f"{col} — AFTER log1p")
plt.tight_layout()
plt.show()
for col in ["INCOME", "T_EXPENDITURE_12", "R_DEBT_INCOME", "R_FINES_SAVINGS"]:
compare_hist(col)
They all improve.
df_original = df.copy()
df[list(must_transform)] = np.log1p(df[list(must_transform)])
df.shape
(1000, 86)
*** We can come back again and compare the result between the original and transformed one.
6. Correlation Analysis¶
We do correlation analysis before scaling.
pearson_corr = df.corr(method='pearson')
plt.figure(figsize=(16, 12))
sns.heatmap(pearson_corr, cmap='coolwarm', center=0)
plt.title("Pearson Correlation Heatmap", fontsize=20)
plt.show()
# 5 columns
sns.heatmap(pearson_corr.iloc[: 5, : 5], annot=True)
plt.figure(figsize =(3,2))
plt.show()
<Figure size 300x200 with 0 Axes>
High-correlated pairs
threshold = 0.85
high_corr_pairs = (
pearson_corr
.where(np.triu(np.ones(pearson_corr.shape), k=1).astype(bool))
.stack()
.reset_index()
)
high_corr_pairs.columns = ['Feature_1', 'Feature_2', 'Correlation']
# Filter by threshold and remove old index numbers
high_corr_pairs = (
high_corr_pairs[high_corr_pairs['Correlation'].abs() >= threshold]
.sort_values(by="Correlation", ascending=False)
.reset_index(drop=True)
)
print(high_corr_pairs[1:5])
Feature_1 Feature_2 Correlation 1 T_EDUCATION_12 T_EDUCATION_6 0.999907 2 T_GAMBLING_12 T_GAMBLING_6 0.999660 3 T_UTILITIES_12 T_UTILITIES_6 0.999219 4 T_TAX_12 T_TAX_6 0.999158
PCA will deal with multicollinearity.
III. EDA¶
Target Variable Analysis¶
# Select and Count the number of numerical columns and categorical or objects columns
object_cols = df.select_dtypes(include=['object']).columns.tolist()
number_object = len(object_cols)
numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns.tolist()
number_numerical = len(numeric_cols)
print(f"\nThe total objects columns in dataset is: {object_cols} \n==> equal {number_object}")
print(f"The total numerical columns in dataset is: {numeric_cols} \n==> equal {number_numerical}")
The total objects columns in dataset is: [] ==> equal 0 The total numerical columns in dataset is: ['INCOME', 'SAVINGS', 'DEBT', 'R_SAVINGS_INCOME', 'R_DEBT_INCOME', 'R_DEBT_SAVINGS', 'T_CLOTHING_12', 'T_CLOTHING_6', 'R_CLOTHING', 'R_CLOTHING_INCOME', 'R_CLOTHING_SAVINGS', 'R_CLOTHING_DEBT', 'T_EDUCATION_12', 'T_EDUCATION_6', 'R_EDUCATION', 'R_EDUCATION_INCOME', 'R_EDUCATION_SAVINGS', 'R_EDUCATION_DEBT', 'T_ENTERTAINMENT_12', 'T_ENTERTAINMENT_6', 'R_ENTERTAINMENT', 'R_ENTERTAINMENT_INCOME', 'R_ENTERTAINMENT_SAVINGS', 'R_ENTERTAINMENT_DEBT', 'T_FINES_12', 'T_FINES_6', 'R_FINES', 'R_FINES_INCOME', 'R_FINES_SAVINGS', 'R_FINES_DEBT', 'T_GAMBLING_12', 'T_GAMBLING_6', 'R_GAMBLING', 'R_GAMBLING_INCOME', 'R_GAMBLING_SAVINGS', 'R_GAMBLING_DEBT', 'T_GROCERIES_12', 'T_GROCERIES_6', 'R_GROCERIES', 'R_GROCERIES_INCOME', 'R_GROCERIES_SAVINGS', 'R_GROCERIES_DEBT', 'T_HEALTH_12', 'T_HEALTH_6', 'R_HEALTH', 'R_HEALTH_INCOME', 'R_HEALTH_SAVINGS', 'R_HEALTH_DEBT', 'T_HOUSING_12', 'T_HOUSING_6', 'R_HOUSING', 'R_HOUSING_INCOME', 'R_HOUSING_SAVINGS', 'R_HOUSING_DEBT', 'T_TAX_12', 'T_TAX_6', 'R_TAX', 'R_TAX_INCOME', 'R_TAX_SAVINGS', 'R_TAX_DEBT', 'T_TRAVEL_12', 'T_TRAVEL_6', 'R_TRAVEL', 'R_TRAVEL_INCOME', 'R_TRAVEL_SAVINGS', 'R_TRAVEL_DEBT', 'T_UTILITIES_12', 'T_UTILITIES_6', 'R_UTILITIES', 'R_UTILITIES_INCOME', 'R_UTILITIES_SAVINGS', 'R_UTILITIES_DEBT', 'T_EXPENDITURE_12', 'T_EXPENDITURE_6', 'R_EXPENDITURE', 'R_EXPENDITURE_INCOME', 'R_EXPENDITURE_SAVINGS', 'R_EXPENDITURE_DEBT', 'CAT_GAMBLING', 'CAT_DEBT', 'CAT_CREDIT_CARD', 'CAT_MORTGAGE', 'CAT_SAVINGS_ACCOUNT', 'CAT_DEPENDENTS', 'CREDIT_SCORE', 'DEFAULT'] ==> equal 86
# Distribution of CREDIT_SCORE
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
# Histogram with KDE
sns.histplot(df['CREDIT_SCORE'], kde=True, ax=axes[0], color='steelblue')
axes[0].set_title('Distribution of CREDIT_SCORE')
axes[0].axvline(df['CREDIT_SCORE'].mean(), color='red', linestyle='--', label=f"Mean: {df['CREDIT_SCORE'].mean():.1f}")
axes[0].axvline(df['CREDIT_SCORE'].median(), color='green', linestyle='--', label=f"Median: {df['CREDIT_SCORE'].median():.1f}")
axes[0].legend()
# Box plot
sns.boxplot(y=df['CREDIT_SCORE'], ax=axes[1], color='steelblue')
axes[1].set_title('Box Plot of CREDIT_SCORE')
# CREDIT_SCORE by DEFAULT
sns.boxplot(x='DEFAULT', y='CREDIT_SCORE', data=df, ax=axes[2], palette='Set2')
axes[2].set_title('CREDIT_SCORE by DEFAULT Status')
axes[2].set_xticklabels(['No Default (0)', 'Default (1)'])
plt.tight_layout()
plt.show()
# Statistics
print("CREDIT_SCORE Statistics:")
print(f" Skewness: {df['CREDIT_SCORE'].skew():.3f}")
print(f" Kurtosis: {df['CREDIT_SCORE'].kurtosis():.3f}")
print(f"\nCREDIT_SCORE by DEFAULT:")
print(df.groupby('DEFAULT')['CREDIT_SCORE'].describe())
C:\Users\User\AppData\Local\Temp\ipykernel_6408\1189726384.py:16: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.boxplot(x='DEFAULT', y='CREDIT_SCORE', data=df, ax=axes[2], palette='Set2') C:\Users\User\AppData\Local\Temp\ipykernel_6408\1189726384.py:18: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator. axes[2].set_xticklabels(['No Default (0)', 'Default (1)'])
CREDIT_SCORE Statistics:
Skewness: -1.005
Kurtosis: 2.033
CREDIT_SCORE by DEFAULT:
count mean std min 25% 50% 75% max
DEFAULT
0 716.0 600.008380 53.299627 385.0 566.00 606.5 635.25 800.0
1 284.0 553.190141 73.776756 300.0 516.75 565.5 609.00 691.0
# Credit Score Segmentation
def credit_score_category(score):
if score < 500:
return 'Poor'
elif score < 600:
return 'Fair'
elif score < 700:
return 'Good'
else:
return 'Excellent'
df['CREDIT_CATEGORY'] = df['CREDIT_SCORE'].apply(credit_score_category)
# Analysis by category
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Count by category
category_order = ['Poor', 'Fair', 'Good', 'Excellent']
sns.countplot(x='CREDIT_CATEGORY', data=df, order=category_order, ax=axes[0], palette='RdYlGn')
axes[0].set_title('Distribution by Credit Score Category')
axes[0].set_xlabel('Credit Category')
# Default rate by category
default_rate = df.groupby('CREDIT_CATEGORY')['DEFAULT'].mean().reindex(category_order)
default_rate.plot(kind='bar', ax=axes[1], color=['red', 'orange', 'yellowgreen', 'green'])
axes[1].set_title('Default Rate by Credit Score Category')
axes[1].set_ylabel('Default Rate')
axes[1].set_xticklabels(category_order, rotation=0)
plt.tight_layout()
plt.show()
# Cross-tabulation
print("Cross-tabulation: CREDIT_CATEGORY vs DEFAULT")
print(pd.crosstab(df['CREDIT_CATEGORY'], df['DEFAULT'], margins=True, normalize='index').round(3))
C:\Users\User\AppData\Local\Temp\ipykernel_6408\3695684859.py:19: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.countplot(x='CREDIT_CATEGORY', data=df, order=category_order, ax=axes[0], palette='RdYlGn')
Cross-tabulation: CREDIT_CATEGORY vs DEFAULT DEFAULT 0 1 CREDIT_CATEGORY Excellent 1.000 0.000 Fair 0.677 0.323 Good 0.814 0.186 Poor 0.333 0.667 All 0.716 0.284
Bivariate Analysis - Features vs Target Variables¶
# Top correlated features with CREDIT_SCORE
credit_corr = df[numeric_cols].corr()['CREDIT_SCORE'].drop('CREDIT_SCORE').sort_values(key=abs, ascending=False)
print("Top 15 features correlated with CREDIT_SCORE:")
print(credit_corr.head(15))
print("\n" + "="*50)
# Top correlated features with DEFAULT
default_corr = df[numeric_cols].corr()['DEFAULT'].drop('DEFAULT').sort_values(key=abs, ascending=False)
print("\nTop 15 features correlated with DEFAULT:")
print(default_corr.head(15))
Top 15 features correlated with CREDIT_SCORE: R_DEBT_INCOME -0.757480 R_DEBT_SAVINGS -0.625888 R_EXPENDITURE_DEBT 0.453632 R_ENTERTAINMENT_DEBT 0.350135 R_GROCERIES_DEBT 0.348084 DEBT -0.338848 R_UTILITIES_DEBT 0.333114 DEFAULT -0.333092 R_TRAVEL_DEBT 0.314333 R_TAX_DEBT 0.304005 R_HEALTH_DEBT 0.301370 R_EDUCATION_INCOME -0.300259 R_CLOTHING_DEBT 0.280410 R_HOUSING_DEBT 0.270634 R_SAVINGS_INCOME 0.241467 Name: CREDIT_SCORE, dtype: float64 ================================================== Top 15 features correlated with DEFAULT: CREDIT_SCORE -0.333092 R_DEBT_INCOME 0.250579 R_DEBT_SAVINGS 0.197758 R_EXPENDITURE_DEBT -0.177296 R_ENTERTAINMENT_DEBT -0.158330 R_TAX_DEBT -0.142238 R_GROCERIES_DEBT -0.140743 R_UTILITIES_DEBT -0.138511 R_HEALTH_DEBT -0.137009 R_TRAVEL_DEBT -0.132935 R_CLOTHING_DEBT -0.128037 R_GROCERIES_SAVINGS 0.121917 CAT_CREDIT_CARD 0.119993 R_EDUCATION_SAVINGS 0.108363 R_UTILITIES_SAVINGS 0.099182 Name: DEFAULT, dtype: float64
# Visualize top correlated features
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
# Top 10 for CREDIT_SCORE
top_credit = credit_corr.head(10)
colors = ['green' if x > 0 else 'red' for x in top_credit.values]
top_credit.plot(kind='barh', ax=axes[0], color=colors)
axes[0].set_title('Top 10 Features Correlated with CREDIT_SCORE')
axes[0].set_xlabel('Correlation')
# Top 10 for DEFAULT
top_default = default_corr.head(10)
colors = ['green' if x > 0 else 'red' for x in top_default.values]
top_default.plot(kind='barh', ax=axes[1], color=colors)
axes[1].set_title('Top 10 Features Correlated with DEFAULT')
axes[1].set_xlabel('Correlation')
plt.tight_layout()
plt.show()
# Scatter plots for top 6 features vs CREDIT_SCORE
top_features = credit_corr.head(6).index.tolist()
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.flatten()
for i, feature in enumerate(top_features):
axes[i].scatter(df[feature], df['CREDIT_SCORE'], alpha=0.5, c=df['DEFAULT'], cmap='coolwarm')
axes[i].set_xlabel(feature)
axes[i].set_ylabel('CREDIT_SCORE')
axes[i].set_title(f'{feature} vs CREDIT_SCORE (r={credit_corr[feature]:.3f})')
plt.tight_layout()
plt.show()
# Box plots for top 6 features by DEFAULT status
top_default_features = default_corr.head(6).index.tolist()
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.flatten()
for i, feature in enumerate(top_default_features):
sns.boxplot(x='DEFAULT', y=feature, data=df, ax=axes[i], palette='Set2')
axes[i].set_title(f'{feature} by DEFAULT (r={default_corr[feature]:.3f})')
axes[i].set_xticklabels(['No Default', 'Default'])
plt.tight_layout()
plt.show()
C:\Users\User\AppData\Local\Temp\ipykernel_6408\425719904.py:8: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.boxplot(x='DEFAULT', y=feature, data=df, ax=axes[i], palette='Set2') C:\Users\User\AppData\Local\Temp\ipykernel_6408\425719904.py:10: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator. axes[i].set_xticklabels(['No Default', 'Default']) C:\Users\User\AppData\Local\Temp\ipykernel_6408\425719904.py:8: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.boxplot(x='DEFAULT', y=feature, data=df, ax=axes[i], palette='Set2') C:\Users\User\AppData\Local\Temp\ipykernel_6408\425719904.py:10: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator. axes[i].set_xticklabels(['No Default', 'Default']) C:\Users\User\AppData\Local\Temp\ipykernel_6408\425719904.py:8: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.boxplot(x='DEFAULT', y=feature, data=df, ax=axes[i], palette='Set2') C:\Users\User\AppData\Local\Temp\ipykernel_6408\425719904.py:10: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator. axes[i].set_xticklabels(['No Default', 'Default']) C:\Users\User\AppData\Local\Temp\ipykernel_6408\425719904.py:8: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.boxplot(x='DEFAULT', y=feature, data=df, ax=axes[i], palette='Set2') C:\Users\User\AppData\Local\Temp\ipykernel_6408\425719904.py:10: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator. axes[i].set_xticklabels(['No Default', 'Default']) C:\Users\User\AppData\Local\Temp\ipykernel_6408\425719904.py:8: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.boxplot(x='DEFAULT', y=feature, data=df, ax=axes[i], palette='Set2') C:\Users\User\AppData\Local\Temp\ipykernel_6408\425719904.py:10: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator. axes[i].set_xticklabels(['No Default', 'Default']) C:\Users\User\AppData\Local\Temp\ipykernel_6408\425719904.py:8: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.boxplot(x='DEFAULT', y=feature, data=df, ax=axes[i], palette='Set2') C:\Users\User\AppData\Local\Temp\ipykernel_6408\425719904.py:10: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator. axes[i].set_xticklabels(['No Default', 'Default'])
Categorical Analysis - CAT_GAMBLING vs Targets¶
# # CAT_GAMBLING analysis
# from scipy.stats import chi2_contingency
# fig, axes = plt.subplots(1, 3, figsize=(18, 5))
# # Count by gambling category
# gambling_order = ['No', 'Low', 'High']
# sns.countplot(x='CAT_GAMBLING', hue='DEFAULT', data=df, order=gambling_order, ax=axes[0], palette='Set2')
# axes[0].set_title('DEFAULT by CAT_GAMBLING')
# axes[0].legend(title='DEFAULT', labels=['No Default', 'Default'])
# # Default rate by gambling category
# default_by_gambling = df.groupby('CAT_GAMBLING')['DEFAULT'].mean().reindex(gambling_order)
# default_by_gambling.plot(kind='bar', ax=axes[1], color=['green', 'orange', 'red'])
# axes[1].set_title('Default Rate by CAT_GAMBLING')
# axes[1].set_ylabel('Default Rate')
# axes[1].set_xticklabels(gambling_order, rotation=0)
# # Credit score by gambling category
# sns.boxplot(x='CAT_GAMBLING', y='CREDIT_SCORE', data=df, order=gambling_order, ax=axes[2], palette='RdYlGn_r')
# axes[2].set_title('CREDIT_SCORE by CAT_GAMBLING')
# plt.tight_layout()
# plt.show()
# # Chi-square test
# contingency_table = pd.crosstab(df['CAT_GAMBLING'], df['DEFAULT'])
# chi2, p_value, dof, expected = chi2_contingency(contingency_table)
# print("Cross-tabulation: CAT_GAMBLING vs DEFAULT")
# print(pd.crosstab(df['CAT_GAMBLING'], df['DEFAULT'], margins=True))
# print(f"\nChi-square test: chi2={chi2:.3f}, p-value={p_value:.4f}")
# print(f"Statistically significant: {'Yes' if p_value < 0.05 else 'No'}")
# print("\n" + "="*50)
# print("\nMean CREDIT_SCORE by CAT_GAMBLING:")
# print(df.groupby('CAT_GAMBLING')['CREDIT_SCORE'].agg(['mean', 'median', 'std']).round(2))
# Analysis of binary categorical features vs DEFAULT
binary_cats = ['CAT_DEBT', 'CAT_CREDIT_CARD', 'CAT_MORTGAGE', 'CAT_SAVINGS_ACCOUNT', 'CAT_DEPENDENTS']
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.flatten()
for i, cat in enumerate(binary_cats):
# Default rate by category
default_rate = df.groupby(cat)['DEFAULT'].mean()
default_rate.plot(kind='bar', ax=axes[i], color=['steelblue', 'coral'])
axes[i].set_title(f'Default Rate by {cat}')
axes[i].set_ylabel('Default Rate')
axes[i].set_xticklabels(['No (0)', 'Yes (1)'], rotation=0)
# Add values on bars
for j, v in enumerate(default_rate.values):
axes[i].text(j, v + 0.01, f'{v:.2%}', ha='center')
axes[5].set_visible(False)
plt.tight_layout()
plt.show()
# Summary statistics
print("Default Rate by Binary Categorical Features:")
for cat in binary_cats:
rate = df.groupby(cat)['DEFAULT'].mean()
print(f"\n{cat}:")
print(f" No (0): {rate.get(0, 0):.2%}")
print(f" Yes (1): {rate.get(1, 0):.2%}")
Default Rate by Binary Categorical Features: CAT_DEBT: No (0): 21.43% Yes (1): 28.81% CAT_CREDIT_CARD: No (0): 25.39% Yes (1): 38.14% CAT_MORTGAGE: No (0): 27.57% Yes (1): 32.37% CAT_SAVINGS_ACCOUNT: No (0): 42.86% Yes (1): 28.30% CAT_DEPENDENTS: No (0): 27.18% Yes (1): 35.33%
Feature Correlation Analysis - Identifying Redundant Features¶
# Find highly correlated feature pairs (potential redundancy)
corr_matrix = df[numeric_cols].corr().abs()
# Get upper triangle
upper_tri = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
# Find pairs with correlation > 0.9
high_corr_pairs = []
for col in upper_tri.columns:
for idx in upper_tri.index:
if upper_tri.loc[idx, col] > 0.9:
high_corr_pairs.append((idx, col, upper_tri.loc[idx, col]))
# Sort by correlation
high_corr_pairs.sort(key=lambda x: x[2], reverse=True)
print(f"Found {len(high_corr_pairs)} highly correlated feature pairs (r > 0.9):")
print("="*70)
for feat1, feat2, corr_val in high_corr_pairs[:20]: # Show top 20
print(f"{feat1} <-> {feat2}: {corr_val:.3f}")
Found 20 highly correlated feature pairs (r > 0.9): ====================================================================== T_HOUSING_12 <-> T_HOUSING_6: 1.000 T_EDUCATION_12 <-> T_EDUCATION_6: 1.000 T_GAMBLING_12 <-> T_GAMBLING_6: 1.000 T_UTILITIES_12 <-> T_UTILITIES_6: 0.999 T_TAX_12 <-> T_TAX_6: 0.999 T_ENTERTAINMENT_12 <-> T_ENTERTAINMENT_6: 0.998 T_GROCERIES_12 <-> T_GROCERIES_6: 0.996 T_TRAVEL_12 <-> T_TRAVEL_6: 0.991 T_EXPENDITURE_12 <-> T_EXPENDITURE_6: 0.989 T_GAMBLING_6 <-> CAT_GAMBLING: 0.982 T_GAMBLING_12 <-> CAT_GAMBLING: 0.982 INCOME <-> T_ENTERTAINMENT_12: 0.965 T_HEALTH_12 <-> T_UTILITIES_12: 0.960 T_HEALTH_12 <-> T_UTILITIES_6: 0.959 INCOME <-> T_ENTERTAINMENT_6: 0.958 R_GROCERIES_DEBT <-> R_UTILITIES_DEBT: 0.943 INCOME <-> T_CLOTHING_12: 0.940 T_CLOTHING_12 <-> T_ENTERTAINMENT_12: 0.914 T_CLOTHING_12 <-> T_ENTERTAINMENT_6: 0.913 T_GROCERIES_12 <-> T_EXPENDITURE_12: 0.901
Principal component analysis (PCA)¶
This scaling can handle extreme value - outliers.
from sklearn.preprocessing import RobustScaler
x = df.drop(columns=['CREDIT_SCORE', 'DEFAULT', 'CREDIT_CATEGORY'])
y_class = df['DEFAULT']
y_reg = df['CREDIT_SCORE']
y_seg = df['CREDIT_CATEGORY']
scaler = RobustScaler()
x_scaled = pd.DataFrame(scaler.fit_transform(x), columns=x.columns)
Fit PCA on x-scaled
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(x_scaled)
PCA()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| n_components | None | |
| copy | True | |
| whiten | False | |
| svd_solver | 'auto' | |
| tol | 0.0 | |
| iterated_power | 'auto' | |
| n_oversamples | 10 | |
| power_iteration_normalizer | 'auto' | |
| random_state | None |
plt.figure(figsize=(10,5))
plt.plot(np.cumsum(pca.explained_variance_ratio_), marker='o')
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")
plt.title("PCA Scree Plot")
plt.grid(True)
plt.show()
Based on PCA scree plot:
- first component PC1 capture ~50% of variance
- By 5 componetns, we already capture about more than 80% of the variance
- By 10 components, the cumulative explained variance reaches ~90% and the curve started to flatten. Therefore, we decided to keep 10 principal components.
n_components = 10
fit pca with chosen component
pca = PCA(n_components=n_components)
x_pca = pca.fit_transform(x_scaled)
pca.explained_variance_ratio_
array([0.49960918, 0.12467667, 0.10535623, 0.05484544, 0.03530568,
0.026041 , 0.02092693, 0.0187986 , 0.01367969, 0.01269902])
pca.explained_variance_ratio_.sum()
np.float64(0.9119384457104507)
Shows which original variables contribute the most to each PC.
loadings = pd.DataFrame(
pca.components_.T,
columns=[f'PC{i+1}' for i in range(n_components)],
index=x.columns
)
pc_tables = {}
for i in range(10):
pc_name = f'PC{i+1}'
top_vals = loadings[pc_name].abs().sort_values(ascending=False).head(10)
pc_tables[pc_name] = pd.DataFrame({
'Variable': top_vals.index,
'Loading Value': top_vals.values
})
for pc, table in pc_tables.items():
print(f"\n==== {pc} ====")
display(table)
==== PC1 ====
| Variable | Loading Value | |
|---|---|---|
| 0 | R_GAMBLING_SAVINGS | 0.977239 |
| 1 | R_EDUCATION_SAVINGS | 0.088294 |
| 2 | R_HEALTH_SAVINGS | 0.082452 |
| 3 | R_GAMBLING_INCOME | 0.076346 |
| 4 | R_GAMBLING_DEBT | 0.066045 |
| 5 | R_CLOTHING_SAVINGS | 0.059961 |
| 6 | R_UTILITIES_SAVINGS | 0.058493 |
| 7 | R_GROCERIES_SAVINGS | 0.057136 |
| 8 | R_ENTERTAINMENT_SAVINGS | 0.037196 |
| 9 | R_HEALTH_DEBT | 0.032289 |
==== PC2 ====
| Variable | Loading Value | |
|---|---|---|
| 0 | R_EDUCATION_SAVINGS | 0.830026 |
| 1 | R_EDUCATION_DEBT | 0.388140 |
| 2 | R_GROCERIES_SAVINGS | 0.165471 |
| 3 | R_EDUCATION_INCOME | 0.143035 |
| 4 | R_HEALTH_SAVINGS | 0.121223 |
| 5 | R_GAMBLING_SAVINGS | 0.117376 |
| 6 | R_ENTERTAINMENT_SAVINGS | 0.115602 |
| 7 | R_UTILITIES_SAVINGS | 0.096593 |
| 8 | R_EXPENDITURE_SAVINGS | 0.080639 |
| 9 | R_TRAVEL_SAVINGS | 0.073878 |
==== PC3 ====
| Variable | Loading Value | |
|---|---|---|
| 0 | R_HEALTH_DEBT | 0.430235 |
| 1 | R_CLOTHING_DEBT | 0.352193 |
| 2 | R_TAX_DEBT | 0.329979 |
| 3 | R_EDUCATION_DEBT | 0.328262 |
| 4 | R_UTILITIES_DEBT | 0.318823 |
| 5 | R_HOUSING_DEBT | 0.318532 |
| 6 | R_ENTERTAINMENT_DEBT | 0.268869 |
| 7 | R_GROCERIES_DEBT | 0.262130 |
| 8 | R_TRAVEL_DEBT | 0.192121 |
| 9 | R_EXPENDITURE_DEBT | 0.151929 |
==== PC4 ====
| Variable | Loading Value | |
|---|---|---|
| 0 | R_EDUCATION_DEBT | 0.728950 |
| 1 | R_HOUSING_DEBT | 0.325328 |
| 2 | R_HEALTH_DEBT | 0.247563 |
| 3 | R_EDUCATION_SAVINGS | 0.236476 |
| 4 | R_TAX_DEBT | 0.174421 |
| 5 | INCOME | 0.109946 |
| 6 | R_GROCERIES_SAVINGS | 0.106388 |
| 7 | R_ENTERTAINMENT_SAVINGS | 0.098834 |
| 8 | T_TAX_12 | 0.098571 |
| 9 | T_TAX_6 | 0.093652 |
==== PC5 ====
| Variable | Loading Value | |
|---|---|---|
| 0 | R_EDUCATION_SAVINGS | 0.374555 |
| 1 | R_ENTERTAINMENT_SAVINGS | 0.280735 |
| 2 | R_EDUCATION_DEBT | 0.239342 |
| 3 | T_HEALTH_12 | 0.239282 |
| 4 | INCOME | 0.236393 |
| 5 | R_TRAVEL_SAVINGS | 0.220699 |
| 6 | T_UTILITIES_12 | 0.205876 |
| 7 | T_UTILITIES_6 | 0.194033 |
| 8 | R_HEALTH_DEBT | 0.182444 |
| 9 | T_HEALTH_6 | 0.181538 |
==== PC6 ====
| Variable | Loading Value | |
|---|---|---|
| 0 | R_HEALTH_SAVINGS | 0.415204 |
| 1 | R_UTILITIES_SAVINGS | 0.290061 |
| 2 | R_CLOTHING_SAVINGS | 0.284561 |
| 3 | R_GROCERIES_SAVINGS | 0.280974 |
| 4 | R_HOUSING_SAVINGS | 0.259285 |
| 5 | R_CLOTHING_DEBT | 0.253029 |
| 6 | R_TAX_DEBT | 0.231399 |
| 7 | R_HOUSING_DEBT | 0.199283 |
| 8 | R_ENTERTAINMENT_SAVINGS | 0.194951 |
| 9 | R_TAX_SAVINGS | 0.171425 |
==== PC7 ====
| Variable | Loading Value | |
|---|---|---|
| 0 | T_TRAVEL_12 | 0.447831 |
| 1 | T_TRAVEL_6 | 0.328668 |
| 2 | R_HOUSING_DEBT | 0.326502 |
| 3 | R_HOUSING_SAVINGS | 0.242190 |
| 4 | R_TRAVEL_SAVINGS | 0.237960 |
| 5 | R_HEALTH_SAVINGS | 0.211043 |
| 6 | R_TRAVEL_INCOME | 0.210480 |
| 7 | R_UTILITIES_SAVINGS | 0.188626 |
| 8 | R_CLOTHING_DEBT | 0.182161 |
| 9 | R_EDUCATION_DEBT | 0.180139 |
==== PC8 ====
| Variable | Loading Value | |
|---|---|---|
| 0 | R_HOUSING_DEBT | 0.714689 |
| 1 | T_TRAVEL_12 | 0.233464 |
| 2 | R_HEALTH_DEBT | 0.199545 |
| 3 | INCOME | 0.190557 |
| 4 | R_CLOTHING_DEBT | 0.168073 |
| 5 | T_TRAVEL_6 | 0.167848 |
| 6 | T_ENTERTAINMENT_12 | 0.160307 |
| 7 | R_HOUSING_SAVINGS | 0.157005 |
| 8 | T_ENTERTAINMENT_6 | 0.151220 |
| 9 | R_EDUCATION_DEBT | 0.143039 |
==== PC9 ====
| Variable | Loading Value | |
|---|---|---|
| 0 | R_TAX | 0.520399 |
| 1 | T_TRAVEL_12 | 0.300750 |
| 2 | R_HEALTH_SAVINGS | 0.271992 |
| 3 | R_GAMBLING_DEBT | 0.228584 |
| 4 | R_HEALTH_DEBT | 0.227304 |
| 5 | T_TRAVEL_6 | 0.225651 |
| 6 | R_TAX_DEBT | 0.213570 |
| 7 | R_ENTERTAINMENT_SAVINGS | 0.196636 |
| 8 | R_ENTERTAINMENT_INCOME | 0.162187 |
| 9 | R_GAMBLING_INCOME | 0.154139 |
==== PC10 ====
| Variable | Loading Value | |
|---|---|---|
| 0 | R_TAX | 0.781085 |
| 1 | T_TRAVEL_12 | 0.298051 |
| 2 | T_TRAVEL_6 | 0.222763 |
| 3 | R_HEALTH_SAVINGS | 0.169571 |
| 4 | R_ENTERTAINMENT_INCOME | 0.141080 |
| 5 | R_UTILITIES_SAVINGS | 0.139574 |
| 6 | R_HOUSING_SAVINGS | 0.136124 |
| 7 | R_ENTERTAINMENT_SAVINGS | 0.128237 |
| 8 | R_TRAVEL_DEBT | 0.116832 |
| 9 | R_TRAVEL_INCOME | 0.115618 |
Interpretation for each PC¶
- PC1: Gambling vs Saving Intensity
- PC2: Education Spending vs Savings Capacity
- PC3: Debt Structure Across All Spending Categories
- PC4: Education & Housing Debt vs Income
- PC5: Health & Utilities Spending + Education Behavior
- PC6: Savings Ratios Across Categories
- PC7: Travel Expenditure Pattern + Housing Debt
- PC8: Housing Debt & Travel Spending Mix
- PC9: Tax Behavior and Travel & Health Spending
- PC10: Tac Burden and Travel Pattern
2D PCA Scatter Plot¶
plt.figure(figsize=(10,7))
plt.scatter(x_pca[:, 0], x_pca[:, 1], c=y_seg.map({'Poor':0,'Fair':1,'Good':2,'Excellent':3}), cmap='viridis')
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("PCA 2D Scatter Plot by Credit Category")
plt.colorbar()
plt.show()
plt.figure(figsize=(10,7))
plt.scatter(x_pca[:,0], x_pca[:,1], c=y_class, cmap='coolwarm', alpha=0.7)
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("PCA 2D Scatter by DEFAULT")
plt.colorbar()
plt.show()
plt.figure(figsize=(10,7))
plt.scatter(x_pca[:,0], x_pca[:,1], c=y_reg, cmap='plasma', alpha=0.7)
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("PCA 2D Scatter by CREDIT_SCORE")
plt.colorbar()
plt.show()
The PCA plot (PC1 vs PC2) shows that the four credit category, risk default and credit score overlap heavily , no clear clusters nor regression line. This means the main direction of variation in data mostly by spending and saving behavior not align with credit categories.
Since PCA is unsupervised and focuses on overall variance, its normal that the credit froups are not visually seperated in the first 2 components.
With PCA data, we can also work toward a new task : Customer Financial Behavior Segmentation.
PCA-based Dataset¶
df_pca = pd.DataFrame(x_pca, columns=[f'PC{i+1}' for i in range(10)])
df_pca["DEFAULT"] = y_class
df_pca["CREDIT_SCORE"] = y_reg
df_pca["CATEGORY"] = y_seg
df_pca.head()
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | DEFAULT | CREDIT_SCORE | CATEGORY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -3.879585 | 3.634852 | -4.278596 | -1.702409 | -0.328213 | -2.611159 | 1.394502 | -1.163674 | 1.294963 | 1.602399 | 1 | 444 | Poor |
| 1 | -4.586699 | -1.677113 | -2.369564 | -0.077723 | -1.543576 | 1.377409 | 0.259867 | 1.753823 | 1.131481 | -0.182091 | 0 | 625 | Good |
| 2 | 10.270231 | -2.368764 | -2.994109 | -0.493575 | -2.943640 | 0.305558 | 0.860887 | 0.351740 | -1.246965 | 1.675060 | 1 | 469 | Poor |
| 3 | 26.529054 | -2.038775 | -1.483680 | 1.014450 | 0.699369 | -1.406972 | 1.236112 | 1.057955 | 1.737288 | 2.145868 | 0 | 559 | Fair |
| 4 | -4.534533 | -2.926804 | -3.447148 | 1.402831 | 1.338662 | -1.189801 | -0.270045 | 0.660445 | -2.096263 | -1.835176 | 0 | 473 | Poor |
Maybe we can use PCA_data to cluster customer financial behaviour segmentation?¶
from sklearn.cluster import KMeans
# Use PC1–PC10 for clustering
X_cluster = df_pca[[f'PC{i+1}' for i in range(10)]]
kmeans = KMeans(n_clusters=4, random_state=42)
clusters = kmeans.fit_predict(X_cluster)
# add cluster labels
df_pca['Cluster'] = clusters
df_pca['Cluster'].unique()
array([0, 1, 3, 2], dtype=int32)
df_pca.head()
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | DEFAULT | CREDIT_SCORE | CATEGORY | Cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -3.879585 | 3.634852 | -4.278596 | -1.702409 | -0.328213 | -2.611159 | 1.394502 | -1.163674 | 1.294963 | 1.602399 | 1 | 444 | Poor | 0 |
| 1 | -4.586699 | -1.677113 | -2.369564 | -0.077723 | -1.543576 | 1.377409 | 0.259867 | 1.753823 | 1.131481 | -0.182091 | 0 | 625 | Good | 0 |
| 2 | 10.270231 | -2.368764 | -2.994109 | -0.493575 | -2.943640 | 0.305558 | 0.860887 | 0.351740 | -1.246965 | 1.675060 | 1 | 469 | Poor | 0 |
| 3 | 26.529054 | -2.038775 | -1.483680 | 1.014450 | 0.699369 | -1.406972 | 1.236112 | 1.057955 | 1.737288 | 2.145868 | 0 | 559 | Fair | 0 |
| 4 | -4.534533 | -2.926804 | -3.447148 | 1.402831 | 1.338662 | -1.189801 | -0.270045 | 0.660445 | -2.096263 | -1.835176 | 0 | 473 | Poor | 0 |
Compute PCA Component means per cluster¶
cluster_profile = df_pca.groupby('Cluster')[ [f'PC{i+1}' for i in range(10)] ].mean().round(2)
cluster_profile
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| Cluster | ||||||||||
| 0 | -0.13 | 0.22 | -2.39 | 1.30 | 1.00 | 0.52 | 0.05 | 0.29 | 0.24 | -0.01 |
| 1 | 77.94 | -1.26 | 2.33 | -1.18 | -0.05 | 0.29 | -0.63 | 0.34 | -0.47 | 0.08 |
| 2 | -5.97 | 2.75 | 33.69 | -2.30 | 4.10 | 2.96 | 0.57 | -0.11 | -0.40 | 0.81 |
| 3 | -4.55 | -0.56 | 2.40 | -2.45 | -2.36 | -1.31 | -0.11 | -0.61 | -0.44 | -0.03 |
Cluster 0: Financially Stable, Low-Risk Behavior
- PC4 = 1.30 ← moderate long-term (education + housing) debt
- PC5 = 1.00 ← moderate essential expenses
- PC6 = 0.52 ← slight savings weakness
- PC2 = 0.22 ← slight education-related pressure
Cluster 1: Extreme Gambling & Financial Risk Behavior
- PC1 = 77.94 ← this is insanely high
Cluster 2: High Debt Load & High Living Expense Pressure
- PC3 = 33.69 ← extremely high
- PC5 = 4.10 ← high health/utility/education burden
- PC6 = 2.96 ← moderate savings inefficiency
- PC2 = 2.75 ← education pressure
Cluster 3: Moderate Long-Term Debt, Moderate Essential Spending
- PC1 = –4.55 ← low gambling behavior
- PC3 = 2.40 but slightly lower than others
- PC4 = –2.45 ← low long-term debt burden
- PC5 = –2.36 ← low essential spending pressure
- PC6 = –1.31 ← efficient savings